Speech synthesis using concatenation of speech waveforms

ABSTRACT

A method of synthesizing a speech signal by providing a first speech unit signal having an end interval and a second speech unit signal having a front interval, wherein at least some of the periods of the end interval are appended in inverted order at the end of the first speech unit signal in order to provide a fade-out interval, and at least some of the periods of the front interval are appended in inverted order at the beginning of the second speech unit signal to provide a fade-in interval. An overlap and add operation is performed on the end and fade-in intervals and the fade-out and front intervals.

Present invention relates to the field of synthesizing of speech ormusic, and more particularly without limitation, to the field oftext-to-speech synthesis.

The function of a text-to-speech (TTS) synthesis system is to synthesizespeech from a generic text in a given language. Nowadays, TTS systemshave been put into practical operation for many applications, such asaccess to databases through the telephone network or aid to handicappedpeople. One method to synthesize speech is by concatenating elements ofa recorded set of subunits of speech such as demi-syllables orpolyphones. The majority of successful commercial systems employ theconcatenation of polyphones.

The polyphones comprise groups of two (diphones), three (triphones) ormore phones and may be determined from nonsense words, by segmenting thedesired grouping of phones at stable spectral regions. In aconcatenation based synthesis, the conversation of the transitionbetween two adjacent phones is crucial to assure the quality of thesynthesized speech. With the choice of polyphones as the basic subunits,the transition between two adjacent phones is preserved in the recordedsubunits, and the concatenation is carried out between similar phones.

Before the synthesis, however, the phones must have their duration andpitch modified in order to fulfill the prosodic constraints of the newwords containing those phones. This processing is necessary to avoid theproduction of a monotonous sounding synthesized speech. In a TTS system,this function is performed by a prosodic module. To allow the durationand pitch modifications in the recorded subunits, many concatenationbased TTS systems employ the time-domain pitch-synchronous overlap-add(TD-PSOLA) (E. Moulines and F. Charpentier, “Pitch synchronous waveformprocessing techniques for text-to-speech synthesis using diphones,”Speech Commun., vol. 9, pp. 453-467, 1990) model of synthesis.

In the TD-PSOLA model, the speech signal is first submitted to a pitchmarking algorithm. This algorithm assigns marks at the peaks of thesignal in the voiced segments and assigns marks 10 ms apart in theunvoiced segments. The synthesis is made by a superposition of Hanningwindowed segments centered at the pitch marks and extending from theprevious pitch mark to the next one. The duration modification isprovided by deleting or replicating some of the windowed segments. Thepitch period modification, on the other hand, is provided by increasingor decreasing the superposition between windowed segments.

Despite the success achieved in many commercial TTS systems, thesynthetic speech produced by using the TD-PSOLA model of synthesis canpresent some drawbacks, mainly under large prosodic variations.

Example of such PSOLA methods are those defined in documents EP-0363233,U.S. Pat. No. 5,479,564, EP-0706170. A specific example is also theMBR-PSOLA method as published by T. Dutoit and H. Leich, in SpeechCommunication, Elsevier Publisher, November 1993, vol. 13, N. degree.3-4, 1993. The method described in document U.S. Pat. No. 5,479,564suggests a means of modifying the frequency by overlap-adding short-termsignals extracted from this signal. The length of the weighting windowsused to obtain the short-term signals is approximately equal to twotimes the period of the audio signal and their position within theperiod can be set to any value (provided the time shift betweensuccessive windows is equal to the period of the audio signal). DocumentU.S. Pat. No. 5,479,564 also describes a means of interpolatingwaveforms between segments to concatenate, so as to smooth outdiscontinuities. In prior art text-to-speech systems a set ofpre-recorded speech fragments can be concatenated in a specific order toconvert a certain text into natural sounding speech. Text-to-speechsystems that use small speech fragments have many such concatenationpoints. Especially when the speech fragments are spectrally different,these joins produce artifacts that reduce the intelligibility. Inparticular, when two speech segments from different recording times areto be concatenated, the resulting speech can have a discontinuity at thejoint of the two segments. For example, when a vowel is synthesized, theleft part mostly comes from a different recording than the right part.This makes it impossible to reproduce the exact color of a vowel.

The slight differences in the formant trajectories produce a sudden jumpat the joint location. What is mostly done in the prior art to reducethis effect is to re-record the speech fragment until it matches withthe rest or add different versions (extra fragments) to minimize thedifference.

The present invention therefore aims to provide an improved method ofsynthesizing of a speech signal, the speech signal having at least afirst diphone and a second diphone. The present invention further aimsto provide a corresponding computer program product and computer system,in particular text-to-speech system.

The present invention provides for a method of synthesizing of speechsignal based on first and second diphone signals which are superposed attheir joint. The invention enables a smooth concatenation of the diphonesignals without any audible artifacts. This is accomplished by appendingperiods of an end interval of the first diphone signal in inverted orderat the end of the first diphone signal and by appending periods of afront interval of the second diphone signal at the beginning of thesecond diphone signal. The end and front intervals are overlapped toproduce the smooth transition.

In accordance with an embodiment of the invention the end and frontintervals of the first and second diphone signal are identified by amarker. Preferably the end and front intervals contain periods which areabout steady, i.e. which have approximately the same information contentand signal form. Such end and front intervals can be identified by ahuman expert or by means of a corresponding computer program. Preferablythe first analysis is performed by means of a computer program and theresult if reviewed by a human expert for increased precision.

In accordance with a further embodiment of the invention the last periodof the end interval and the first period of the front interval are notappended. This has the advantage that no periodicity is introduced intothe signal by the immediate repetition of two identical periods.

In accordance with a further embodiment of the invention a windowingoperation is performed on the end and front intervals as well as on therespective appended periods by means of fade-out and fade-in windows,respectively. Preferably a raised cosine window function is used forvoiced end intervals and the appended periods, whereas for unvoiced endintervals and the appended periods a sine window is used as a fade-outwindow. Likewise a raised cosine is used as a window function forsmoothening the beginning of a voiced segment of the second diphone or asine window for unvoiced segments.

In accordance with an embodiment of the invention a duration adaptationis performed for the intervals to be overlapped. Especially if theintervals have different durations this is advantageous in order toavoid the introduction of abrupt signal transitions.

In accordance with a further embodiment of the invention, text-to-speechprocessing is performed by concatenating diphones in accordance with theprinciples of the present invention. This way a natural sounding speechoutput can be produced.

It is important to note that the present invention is not restricted tothe concatenation of diphones but can also be advantageously employedfor the concatenation of other speech units such as triphones,polyphones or words.

In the following embodiments of the invention are described in greaterdetail by making reference to the drawings in which:

FIG. 1 depicts a flow chart of a preferred embodiment of a method of theinvention,

FIG. 2 depicts the interleaved repetition of periods at the end and thefront of the original diphone signals,

FIG. 3 depicts an example for a signal synthesis, and

FIG. 4 depicts a block diagram of an embodiment of a text-to-speechsystem.

FIG. 1 shows a flow diagram which illustrates a preferred embodiment ofa method of the present invention. In step 100 a first diphone signal Ais provided. The diphone signal A has at least one marker whichidentifies an end interval of the diphone A signal.

In step 102 periods within the end interval of the diphone signal A arerepeated in inverted order in order to provide a fade-out interval whichis appended at the end of the end interval. In step 104 the end intervalwith its' appended fade-out interval are windowed by means of a fade-outwindow function in order to smoothly fade out the diphone signal at its'end. Likewise a diphone signal B is provided in step 106. The diphonesignal B has at least one associated marker in order to identify a frontsegment of the diphone signal B. In step 108 at least some of the frontintervals periods are appended at the beginning of the front interval ofthe diphone signal B in inverted order. This way a fade-in interval isprovided. In step 110 the front interval and the appended fade-ininterval are windowed by means of a fade-in window. This way a smoothbeginning of the diphone signal B is provided. In step 112 a durationadaptation is performed. This means that the durations of the end andfront intervals of the diphone signals A and B are modified such thatthe end and fade-in intervals have the same duration. Likewise thedurations of the fade-out and front intervals are adapted. In step 114an overlap and add operation is performed on the diphone signals A and Bwith the processed end and fade-in intervals and the fade-out and frontintervals. This way a smooth concatenation of the diphone signals A andB is accomplished. For voiced segments usage of the following raisedcosine window function is preferred:

${{w\lbrack n\rbrack} = {0.5 - {0.5 \cdot {\cos( \frac{\pi \cdot ( {n + 0.5} )}{m} )}}}},{0 \leq n < m}$

-   -   where m is the total number of periods in the smoothing range.

For unvoiced segments, a sine window is used:

${{w\lbrack n\rbrack} = {\sin( \frac{0.5 \cdot \pi \cdot ( {n + 0.5} )}{m} )}},{0 \leq n < m}$

The advantage of using a sine-window is that this ensures that the totalsignal envelope in power-domain remains constant. Unlike a periodicsignal, when two noise samples are added, the total sum can be smallerthan the absolute value of any of the two samples. This is because thesignals are (mostly) not in-phase. The sine-window adjusts for thiseffect and removes the envelope-modulation.

FIG. 2 illustrates the process of appending interval periods in invertedorder (cf. steps 102 and 108 of FIG. 1). Time axis 200 illustrates thetime domain of diphone signal A. The diphone signal A has an endinterval 202 which contains periods p₁, p₂, . . . , P_(i), . . . ,P_(N−1), P_(N). In order to provide fade-out interval 204 periods p_(i)of the end interval 202 are appended at the end of the end interval 202in inverted order. The last period P_(N) of the end interval 202 is notappended in order to avoid a repetition of two identical periods whichwould introduce an unintended periodicity. Such a periodicity couldbecome audible under certain circumstances. It is therefore preferrednot to repeat the least period P_(N) of the end interval 202. The firstperiod p′₁ of the fade-out interval 204 is provided by copying thesignal of period P_(N−1). In general, period p′_(j) of fade-out interval204 is obtained by appending period P_(N−j) from the end interval 202,i.e. p′_(j)=p_(N−j). Time axis 206 is illustrative of the time domain ofdiphone signal B. Diphone signal B has a front interval 208 containingperiods P₁, P₂, . . . , P_(i), . . . , P_(N−1), P_(N). Fade-in interval210 is provided by appending periods from front interval 208 at thebeginning of front interval 208 in inverted order. Again it is preferrednot to append the first period P₁ of the front interval 208 to avoid theintroduction of unintended periodicity. In the general case a signalperiod P′_(j) is obtained from the period P_(N−j+1) of the frontinterval 208, i.e. P′_(j)=P_(N−j+1) For concatenating the diphone signalA and the diphone signal B, the end interval 202 and the fade-ininterval 210 are overlapped and added as well as the fade-out interval204 and front interval 208. In the example considered here this can bedone without adapting the durations of the respective intervals, as thedurations of the end interval 202 and the fade-in interval 210 as wellas the durations of the fade-out interval 204 and the front interval 208are the same.

FIG. 3 shows an example for the various synthesis steps for the word‘young’. This word is made of the phonemes /j/, /V/, /N/ and the silence/_/.a) and b) are the recorded nonsense words that contain thetransitions from /j/ to /V/ and /V/ to /N/. Within each nonsense wordfive markers are placed. The outer markers are the diphone borders(labels j-, -V, V- and -N). The markers in the middle show where a newphoneme starts (labels V, and N). The other labels are used to mark thesegments that will be used for overlap-add. As it is illustrated in thediagram (c) of FIG. 3 the periods of the end interval 300 are repeatedin inverted order to provide a fade-out interval 302. All the periodswithin end interval 300 are appended after period 304 which is the lastperiod of the end interval 300. Period 304 itself is not appended toavoid the repetition of the same period which would introduce anunintended periodicity. Likewise for the diphone signal of diagram (b)of FIG. 3 the periods within front interval 306 are appended at thebeginning of the front interval 306 in inverted order. This applies forall of the period within the front interval 306 except the first period310 at the beginning of the front interval 306. Again this period 310 isnot appended in order to avoid two consecutive identical periods whichwould introduce an unintended periodicity. The same kind of processingis done for the front interval 312 of the diphone signal of the diagram(a) and for the end interval 314 of the diphone signal of diagram (b).Further the same approach is applied to the further diphones which arerequired to be concatenated for the synthesis of the word ‘young’. Nexta smoothening window is applied to the front, end, fade-in and fade-outintervals. For voiced segments a raised cosine is preferably used as awindow function. The following window function is employed for thefade-in and front intervals:

${{w\lbrack n\rbrack} = {0.5 - {0.5 \cdot {\cos( \frac{\pi \cdot ( {n + 0.5} )}{m} )}}}},{0 \leq n < m}$

where m is the total number of periods in the smoothening range. Thecorresponding raised cosine is shown as raised cosine 316 in diagram(d). A corresponding window function is used to provide raised cosine318 for the end and fade-out intervals 300 and 302. As it is illustratedin the diagram (e) the durations of the intervals to be overlapped andadded, i.e. intervals 300/308 and intervals 302/306 are rescaled inorder to bring them to an equal length. The following superposition ofthe required diphone provides the synthesis of the word ‘young’.

FIG. 4 shows a block diagram of computer system 400, which is atext-to-speech system. The computer system 400 has module 402 whichserves to store diphones and markers for the diphones to indicate frontand end intervals. Module 404 serves to repeat periods contained in theend and front intervals in inverted order in order to provide fade-inand fade-out intervals. Module 406 serves to provide a window functionfor windowing the end/fade-out and fade-in/front intervals for thepurposes of smoothening. Module 408 serves for duration adaptation ofthe intervals to be superposed. Such a duration adaptation is requiredif the intervals to be superposed are not of equal length. Module 410serves for the superposition of the end/fade-in and of thefade-out/front intervals in order to concatenate their requireddiphones. When text is entered into the computer system 400 the requireddiphones to be concatenated are selected from module 402. These diphonesare processed by means of modules 404, 406 and 408 before they areoverlapped and added by means of module 410, which results in therequired synthesized speech signal.

1. A method of synthesizing of a speech signal, the speech signal havingat least a first speech unit and a second speech unit, the methodcomprising the steps of: providing a first speech unit signal, the firstspeech unit signal having an end interval, providing a second speechunit signal, the second speech unit signal having a front interval,appending of at least some periods of the end interval in inverted orderat the end of the first speech unit signal to provide a fade-outinterval, appending of at least some periods of the front interval ininverted order at the beginning of the second speech unit signal toprovide a fade-in interval, superposing of the end and fade-in intervalsand of the fade-out and front intervals.
 2. The method of claim 1,whereby the end and front intervals have approximately steady periods.3. The method of claim 1 or 2, the end and front intervals beingidentified by a marker.
 4. The method of claim 1, whereby the lastperiod of the end interval and the first period of the front intervalare not appended.
 5. The method of claim 1, further comprising windowingof the end and/or fade-out intervals with a fade-out window.
 6. Themethod of claim 5, whereby a raised cosine is used as a fade-out window.7. The method of claim 6, whereby the following window function is usedfor voiced intervals: where m is the total number of periods in asmoothening range${{w\lbrack n\rbrack} = {0.5 - {0.5 \cdot {\cos( \frac{\pi \cdot ( {n + 0.5} )}{m} )}}}},{0 \leq n < {m.}}$8. The method of claim 5, whereby a sine window is used as a fade-outwindow for unvoiced intervals.
 9. The method of claim 8, whereby thefollowing window function is used: $\begin{matrix}{{{w\lbrack n\rbrack} = {\sin( \frac{0.5 \cdot \pi \cdot ( {n + 0.5} )}{m} )}},{0 \leq n < m}} & (2.7)\end{matrix}$ where m is the total number of periods in a smootheningrange.
 10. The method of claim 1, the first and second speech unitsbeing diphones and/or triphones and/or polyphones, in particular words.11. The method of claim 1, further comprising adapting the durations ofthe end and fade-in intervals and of the fade-out and front intervals.12. The method of claim 1, whereby the speech signal is synthesized bymeans of an overlap and add operation.
 13. Computer digital storagemedium, comprising program means for synthesizing of a speech signal,the speech signal having at least a first speech unit and a secondspeech unit, the program means being adapted to perform the steps of:providing a first speech unit signal, the first speech unit signalhaving an end interval, providing a second speech unit signal, thesecond speech unit signal having a front interval, appending of at leastsome periods of the end interval in inverted order at the end of thefirst speech unit signal to provide a fade-out interval, appending of atleast some periods of the front interval in inverted order at thebeginning of the second speech unit signal to provide a fade-ininterval, superposing of the end and fade-in intervals and of thefade-out and front intervals.
 14. Computer system, in particulartext-to-speech system, for synthesizing of a speech signal, the speechsignal having at least a first speech unit and a second speech unit, thecomputer system comprising: means (402) for storing of a first speechunit signal, the first speech unit signal having an end interval, andfor storing of a second speech unit signal, the second speech unitsignal having a front interval, means (404) for appending of at leastsome periods of the end interval (202; 300) in inverted order at the endof the first speech unit signal to provide a fade-out interval (204;302), means (404) for appending of at least some periods of the frontinterval (208; 306) in inverted order at the beginning of the secondspeech unit signal to provide a fade-in interval (308), means (410) forsuperposing of the end and fade-in intervals and of the fade-out andfront intervals.